---
layout: post
title: Benchmarks of speed (Numpy vs all)
date: '2015-01-06T14:35:00.001-08:00'
author: Alex Rogozhnikov
tags:
- Machine Learning
- vectorization
- Theano
- Python
- Optimization
modified_time: '2015-07-08T08:04:40.009-07:00'
blogger_id: tag:blogger.com,1999:blog-307916792578626510.post-8966676935851717392
blogger_orig_url: http://brilliantlywrong.blogspot.com/2015/01/benchmarks-of-speed-numpy-vs-all.html
---
<div>Personally I am a big fan of <strong>numpy</strong> package, since it makes the code clean and still quite fast.<br/>
However I am much worried about the speed, so decided to collect different benchmarks<br/>
	<br/>
	<strong>numpy</strong> vs <strong>cython</strong> vs <strong>weave</strong> (numpy is about 2 times slower than others)<br/>
(posted in 2011)<br/>
	<a href="http://technicaldiscovery.blogspot.ru/2011/06/speeding-up-python-numpy-cython-and.html">http://technicaldiscovery.blogspot.ru/2011/06/speeding-up-python-numpy-cython-and.html</a><br/>
	<br/>
Primarily the post is about numba, the pairwise distances are computed with <strong>cython</strong>, <strong>numpy</strong>, <strong>numba</strong>.<br/>
	<strong>Numba</strong> is claimed to be the fastest, around 10 times faster than numpy.<br/>
(posted in 2013)<br/>
	<a href="https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/">https://jakevdp.github.io/blog/2013/06/15/numba-vs-cython-take-2/</a><br/>
	<br/>
	<strong>Julia</strong> is claimed by its developers to be very fast language.<br/>
Well, if that is true, there would be no need in writing easily-vectorizable operation in pure python (yep, I mean they are simply cheating), currently these 'benchmarks' are at the main page<br/>
	<a href="http://julialang.org/">http://julialang.org/</a><br/>
	<br/>
The following post was written in 2011, the problem observed is solution of Laplace equation as usual.<br/>
Numpy is ~10 times slower than pure C++ solution (and equal to matlab), <strong>weave</strong> shows the speed comparable with pure C++ (I don't know anyone using weave now)<br/>
	<a href="http://wiki.scipy.org/PerformancePython">http://wiki.scipy.org/PerformancePython</a><br/>
	<br/>
Fresh (2014) benchmark of different python tools, simple vectorized expression A*B-4.1*A &gt; 2.5*B is evaluated with <strong>numpy, cython, numba, numexpr, and parakeet</strong> (and two latest are the fastest - about 10 times less time than numpy, achieved by using multithreading with two cores)<br/>
	<a href="http://nbviewer.ipython.org/github/rasbt/One-Python-benchmark-per-day/blob/master/ipython_nbs/day7_2_jit_numpy.ipynb">http://nbviewer.ipython.org/github/rasbt/One-Python-benchmark-per-day/blob/master/ipython_nbs/day7_2_jit_numpy.ipynb</a><br/>
	<br/>
Haven't found any general-purpose <strong>theano vs numpy </strong>benchmarks, but in the article there is comparison of neural networks and theano is expected to give much better speed than numpy/torch(c++)/matlab, specially it is fast on GPU<br/>
	<a href="http://conference.scipy.org/proceedings/scipy2010/pdfs/bergstra.pdf">http://conference.scipy.org/proceedings/scipy2010/pdfs/bergstra.pdf</a><br/>
	<br/>
	<br/>
One more detailed review of numpy vs cython vs c (held in 2014)<br/>
	<a href="http://notes-on-cython.readthedocs.org/en/latest/std_dev.html">http://notes-on-cython.readthedocs.org/en/latest/std_dev.html</a><br/>
Let me copy-paste the example results (computation of std).<br/>
	<br/>
	<table class="comparison dataframe">
		<thead>
			<tr>
				<th>Method</th>
				<th>Time (ms)</th>
				<th>Compared to Python</th>
				<th>Compared to Numpy</th>
			</tr>
		</thead>
		<tbody>
			<tr>
				<td>Pure Python</td>
				<td>183</td>
				<td>x1</td>
				<td>x0.03</td>
			</tr>
			<tr>
				<td>Numpy</td>
				<td>5.97</td>
				<td>x31</td>
				<td>x1</td>
			</tr>
			<tr>
				<td>Naive Cython</td>
				<td>7.76</td>
				<td>x24</td>
				<td>x0.8</td>
			</tr>
			<tr>
				<td>Optimised Cython</td>
				<td>2.18</td>
				<td>x84</td>
				<td>x2.7</td>
			</tr>
			<tr>
				<td>Cython calling C</td>
				<td>2.22</td>
				<td>x82</td>
				<td>x2.7</td>
			</tr>
		</tbody>
	</table>
	<br/>
	<br/>
Simple, but comprehensive comparison of python accelerators was prepared by Jake Vanderplaas and Olivier Grisel:<br/>
	<a href="http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/Numba%20Parakeet%20Cython.ipynb">http://nbviewer.ipython.org/github/ogrisel/notebooks/blob/master/Numba%20Parakeet%20Cython.ipynb</a><br/>
Problem is computation of pairwise distances. The results this time seem to be unbiased:<br/>
	<br/>
	<table class="comparison dataframe">
		<tbody>
			<tr>
				<td>Python</td>
				<td>9.51s</td>
			</tr>
			<tr>
				<td>Naive numpy</td>
				<td>64.7 ms</td>
			</tr>
			<tr>
				<td>Numba</td>
				<td>6.72ms</td>
			</tr>
			<tr>
				<td>Cython</td>
				<td>6.57ms</td>
			</tr>
			<tr>
				<td>Parakeet</td>
				<td>12.3 ms</td>
			</tr>
			<tr>
				<td>Cython</td>
				<td>6.57 ms</td>
			</tr>
		</tbody>
	</table>
	<br/>
	<br/>
	<br/>
By the way, I was recently looking for a slow place in my code, and it totally dropped of my mind that <em>computation of residual is long operation</em>:<br/>
	<a href="http://embeddedgurus.com/stack-overflow/2011/02/efficient-c-tip-13-use-the-modulus-operator-with-caution/">http://embeddedgurus.com/stack-overflow/2011/02/efficient-c-tip-13-use-the-modulus-operator-with-caution/</a><br/>
Integer division / remainder computation time significaly depends on the bit depth (16 bit operation are ~2 times faster than 32 bit ones, division takes roughly 10 times more time than multiplication).<br/>
	<br/>
Surprising: multiplication/addition/binary operations take the same time (compared on numpy 1.9), which I personally find very strange (this behavior may be caused by quite complicated addressing in numpy arrays, which becomes the bottleneck, but this is only a guess).
</div>